Red Wine Quality

Univariate Plots Section

The data set structure:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Summary of the data set:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

We can see that the Variable X is just a numbering of the observations, so we drop it for the sake of clarity.

redwine$X = NULL

Let’s explore each individual variable. We start with quality since it is the main feature of interest.

quality

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The distribution of this variable looks approximately normal with a slight left-skewness. More than 1300 wines (which is over 80% of all wines) received a score of either 5 or 6.

fixed.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

This variable ranges from 4.6 to 15.9 with a mode of 7.2. The distribution looks right-skewed normal. In addition to that we can see a few outliers to the right.

volatile.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

This plot has similar characteristics as the one before. There are several outliers to the right and the distribution seems right-tailed normal, too. However, this variable has less variance than the fixed.acidity feature.

citric.acid

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

It is hard to tell what kind of distribution this is. The mode at 0 is striking, as well as the outlier at 1.

residual.sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The distribution of this variable has a long tail, so we apply a log10 transformation on the x-axis.

The result shows a right-skewed normal distribution.

chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

It looks like this distribution has a tail to the right combined with some extreme outliers beyond. We display two additional plots for comparison. The first has a log10 transformation, while the second cuts off the top 3% chlorides values.

The distributions are similar, i.e. approximately normal with some tail to the right.

free.sulfur.dioxide

## 
##    1    2    3    4    5  5.5    6    7    8    9   10   11   12   13   14 
##    3    1   49   41  104    1  138   71   56   62   79   59   75   57   50 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##   78   61   60   46   39   30   41   22   32   34   24   32   29   23   23 
##   30   31   32   33   34   35   36   37 37.5   38   39   40 40.5   41   42 
##   16   20   22   11   18   15   11    3    2    9    5    6    1    7    3 
##   43   45   46   47   48   50   51   52   53   54   55   57   66   68   72 
##    3    3    1    1    4    2    4    3    1    1    2    1    1    2    1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

This variable follows a right-skewed distribution with some outliers to the right.

total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Again, a right skewed distribution with two extreme outliers. Let’s refine the plot by removing the outliers and adjusting the binwidth.

density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The first variable in this data set which follows an almost textbook-like normal distribution.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Another normal distributed variable with some outliers left and right.

sulphates

## 
## 0.33 0.37 0.39  0.4 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 
##    1    2    6    4    5    8   16   12   18   19   29   31   27   26   47 
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 
##   51   68   50   60   55   68   51   69   45   61   48   46   41   42   36 
## 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 
##   35   23   33   26   28   26   26   20   25   26   23   18   19   15   22 
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 
##   15   13   14   13   13    7    7    8    8    5   10    4    2    3    6 
## 0.98 0.99    1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09  1.1 1.11 1.12 
##    2    3    1    1    3    2    2    3    4    2    3    1    2    1    1 
## 1.13 1.14 1.15 1.16 1.17 1.18  1.2 1.22 1.26 1.28 1.31 1.33 1.34 1.36 1.56 
##    2    2    1    1    5    3    1    1    1    2    1    1    1    3    1 
## 1.59 1.61 1.62 1.95 1.98    2 
##    1    1    1    2    1    1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

A right-skewed distribution with some outliers to the right. We exclude them for our refined plot.

alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The distribution of this variable is right-skewed.

Univariate Analysis

What is the structure of your dataset?

The data set contains 1599 observations with 12 variables on the properties of the wine. Quality is a categorical variable, while the remaining features are numerical.

What is/are the main feature(s) of interest in your dataset?

The main feature is quality, an integer variable measured from 0 (worst) to 10 (best). There are no wines rated with a quality of 0, 1, 2, 9, or 10 in this particular data set. We will examine if and how the other features influence the quality of the wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Intuitively, I assume that alcohol, residual.sugar, and acidity have the most influence on the quality of a wine. One reason is alcohol serving as a flavor carrier and secondly when tasting wine you notice the sweet- and sourness first and foremost.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I find the distribution of citric.acid quite unusual. It does not follow a clear distribution and has the mode at 0. Alcohol has several spikes within its distribution which is unexpected for me. I would have assumed a much smoother distribution. Maybe winemakers target very specific alcohol levels during vinification.

Fortunately, the data set was already tidy. I removed the X variable, because it served just as a numbering for each observation.

Bivariate Plots Section

We examine the correlation between each pair of variables.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Let’s visualize these values in a correlation matrix:

Quality has the strongest correlation with alcohol, followed by volatile.acidity, sulphates, and citric.acid. Volatile.acidity is the only feature in this group having a negative correlation with quality. Residual.sugar has surpringsly no correlation with quality and only little to none with the other features.

Maybe a scatterplot matrix of the variables of interest can give us more insights:

The scatterplots illustrate nicely the lack of relationship between residual.sugar with other variables and in particular with quality. We can also identify visually the influence of the varibles on quality with the help of the linear smoothing functions.

residual.sugar and quality

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$residual.sugar and redwine$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

There is relatively low variation of residual sugar. The median of residual.sugar is roughly 2 for each quality score.

alcohol and quality

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$alcohol and redwine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

The boxplot shows that the median percent alcohol content ranges from 9.7% up to 12.15%. It gets higher as the quality increases. The median for wines with quality of 5 or below is around 10%.

volatile.acidity and quality

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$volatile.acidity and redwine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

Volatile.acidity shows an inverse relation with quality. The higher the quality, the lower the median volatile.acidity. The median ranges from 0.37 to 0.845.

sulphates and quality

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$sulphates and redwine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

citric.acid and quality

## 
##  Pearson's product-moment correlation
## 
## data:  redwine$citric.acid and redwine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

Both sulphates and citric.acid have a positive relation with quality. The relation is less distinct compared to alcohol and volatile.acidity. In addition to that, we observe that the variance within citric.acid is notably higher than within the sulphates variable.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The strongest correlation of quality with a feature is alcohol (0.48), followed by volatile.acidity (-0.39). There is also weak correlation between quality and sulphates (0.25) and quality and citric.acid (0.23).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What I found particularly surprising is the lack of correlation between quality and residual.sugar. As stated before, I assumed sugar is important for the taste of the wine and therefore it’s quality rating. Moreover, residual.sugar has only none to low correlation with the other features.

What was the strongest relationship you found?

The strongest correlation with -0.68 is between pH and fixed.acidity. This is not surprising, since pH describes how acidic or basic a wine is.

Multivariate Plots Section

We identified four features having the strongest correlation with quality in the previous section. Now we investigate the combinations of two of these features and their influence on quality.

alcohol vs volatile.acidity and quality

It is hard to distinguish different quality scores, thus we use a better color scheme.

##                     alcohol volatile.acidity    quality
## alcohol           1.0000000       -0.2022880  0.4761663
## volatile.acidity -0.2022880        1.0000000 -0.3905578
## quality           0.4761663       -0.3905578  1.0000000

Better wines are on the bottom right on the plot (high alcohol, low volatile acidity), while poorly rated wines tend to the top left (low alcohol, high volatile acidity).

alcohol vs sulphates and quality

##              alcohol  sulphates   quality
## alcohol   1.00000000 0.09359475 0.4761663
## sulphates 0.09359475 1.00000000 0.2513971
## quality   0.47616632 0.25139708 1.0000000

Both alcohol and sulphates have a positive correlation with quality, confirming our previous findings.

alcohol vs citric.acid and quality

##               alcohol citric.acid   quality
## alcohol     1.0000000   0.1099032 0.4761663
## citric.acid 0.1099032   1.0000000 0.2263725
## quality     0.4761663   0.2263725 1.0000000

A similar plot as before, though with much more variance on the y-axis.

volatile.acidity vs sulphates and quality

##                  volatile.acidity  sulphates    quality
## volatile.acidity        1.0000000 -0.2609867 -0.3905578
## sulphates              -0.2609867  1.0000000  0.2513971
## quality                -0.3905578  0.2513971  1.0000000

volatile.acidity vs citric.acid and quality

##                  volatile.acidity citric.acid    quality
## volatile.acidity        1.0000000  -0.5524957 -0.3905578
## citric.acid            -0.5524957   1.0000000  0.2263725
## quality                -0.3905578   0.2263725  1.0000000

This plot is less clear than the others. Given the similar nature of volatile.acidity and citric.acid we assume a strong correlation between them. A quick calculation confirms our suspicion: these variables are correlated with a value of -0.55.

sulphates vs citric.acid and quality

##             sulphates citric.acid   quality
## sulphates   1.0000000   0.3127700 0.2513971
## citric.acid 0.3127700   1.0000000 0.2263725
## quality     0.2513971   0.2263725 1.0000000

Before we build our regression model based on these features, we have another look at the corresponding correlation table.

##                      alcohol volatile.acidity   sulphates citric.acid
## alcohol           1.00000000       -0.2022880  0.09359475   0.1099032
## volatile.acidity -0.20228803        1.0000000 -0.26098669  -0.5524957
## sulphates         0.09359475       -0.2609867  1.00000000   0.3127700
## citric.acid       0.10990325       -0.5524957  0.31277004   1.0000000

I would argue to leave citric.acid out for our model, because it is quite strongly correlated with volatile.acidity and (to a lesser degree) with sulphates. We will build four different models, adding one more feature each time (including citric.acid nonetheless to check our intuition).

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = redwine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = redwine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = redwine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = redwine)
## 
## ================================================================
##                        m1         m2         m3         m4      
## ----------------------------------------------------------------
##   (Intercept)        1.875***   3.095***   2.611***   2.646***  
##                     (0.175)    (0.184)    (0.196)    (0.201)    
##   alcohol            0.361***   0.314***   0.309***   0.309***  
##                     (0.017)    (0.016)    (0.016)    (0.016)    
##   volatile.acidity             -1.384***  -1.221***  -1.265***  
##                                (0.095)    (0.097)    (0.113)    
##   sulphates                                0.679***   0.696***  
##                                           (0.101)    (0.103)    
##   citric.acid                                        -0.079     
##                                                      (0.104)    
## ----------------------------------------------------------------
##   R-squared              0.2        0.3        0.3        0.3   
##   adj. R-squared         0.2        0.3        0.3        0.3   
##   sigma                  0.7        0.7        0.7        0.7   
##   F                    468.3      370.4      268.9      201.8   
##   p                      0.0        0.0        0.0        0.0   
##   Log-likelihood     -1721.1    -1621.8    -1599.4    -1599.1   
##   Deviance             805.9      711.8      692.1      691.9   
##   AIC                 3448.1     3251.6     3208.8     3210.2   
##   BIC                 3464.2     3273.1     3235.7     3242.4   
##   N                   1599       1599       1599       1599     
## ================================================================

As assumed citric.acid does not improve our model significantly. Interestingly, all models have a low R-squared value with at most 0.3.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

All multivariate plots confirm the relationships of our previous findings. Especially alcohol serves as a good indicator for quality. One reason being its low correlation with the other predictors.

Were there any interesting or surprising interactions between features?

The plot with volatile.acidity vs citric.acid was surprising, since all the other plots showed a stronger trend. However, the surprise was quickly gone after realizing the similarity and correlation between both variables.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I built different models incorporating successively alcohol, volatile.acidity, sulphates, and citric.acid. All these models perform rather badly in predicting wine quality. If I had to choose one model it would be m2, which uses alcohol and volatile.acidity as input variables. I prefer it over m3, because it has similar performance while being simpler.


Final Plots and Summary

Plot One

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Description One

Quality, our main feature of interest, follows an approximately normal distribution with a slight left-skewness. It has low variance, because over 80% of all wines being rated with either 5 or 6. In addition to that, there are no wines present in this data set with a score of 0, 1, 2, 9, or 10.

Plot Two

## redwine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## redwine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## redwine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## redwine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## redwine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## redwine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

Description Two

A surprising result: residual.sugar has no correlation with quality. The median amount of sugar is just above 2 grams per litre across all quality scores. Most wines have a residual.sugar value between 1.2 and 3.5 grams.

Plot Three

## Source: local data frame [6 x 2]
## 
##   quality         COR
##     (int)       (dbl)
## 1       3  0.71739157
## 2       4  0.07559736
## 3       5 -0.01611186
## 4       6 -0.10765046
## 5       7  0.01889942
## 6       8  0.53271120

Description Three

Alcohol and volatile.acidity are the two features with the highest influence on quality. The best rated wines are on the bottom right on the plot, meaning high alcohol percentage and low volatile acidity. Looking at the correlation between alcohol and volatile.acidity per quality, we see that except for a score of 3 and 8, there is no correlation resulting in desirable features for a linear model.


Reflection

This project was very interesting and challenging. It was more time consuming than anticipated. Though this project emphasizes exploration (quick and dirty), it took some time to write down my thought process and polish it.

I was a bit disappointed regarding the data set itself. Mainly, I wished it would have contained more observations covering all ratings. Secondly, I was under the impression that several obvious features were missing, e.g. age of the wine, climate data, grape variety, or wine maker. Not surprisingly, this lead to poor results when building the linear regression models.

For future work, I would search for a richer data set with more observations and more variables. Additionally, a different machine learning approach like logistic regression or decision trees seem more appropriate to predict quality. In the end, my exploration could not uncover distinct linear relationships. Another area of improvement could be feature transformation, which I did not apply at all.

So what is the main takeaway? Whenever you want to buy a good bottle of red wine, look out for high alcohol content.

References

http://stackoverflow.com/questions/10680658/how-can-i-create-a-correlation-matrix-in-r

http://www.colorcombos.com/colors/5F021F

https://discussions.udacity.com/t/ggplot-functions/19294/2

http://ggobi.github.io/ggally/gh-pages/ggpairs.html

http://stats.stackexchange.com/questions/4040/r-compute-correlation-by-group

http://docs.ggplot2.org/current/guide_legend.html